Abstract: Clustering analysis is most significant tool for distribution of data. The aim of clustering is to find intrinsic structures in data and categorize them into meaningful subgroups for further study and analysis. In clustering certain assumptions are made about some cluster relationships among the data objects that they are applied on. The process of initiation of cluster formation is based on similarity measure. Unique clusters are formed with the same data set taking help of different notations used in variety of clustering algorithms. K-Means Clustering is one such technique used to provide a structure to unstructured data so that valuable information can be extracted. In this paper we are going to study the implementation of K-Means Clustering Algorithm over a distributed environment using Apache Hadoop. The main focus of the paper is on implementation of the K-Means Algorithm is the design of the Mapper and Reducer routines which has been discussed in the paper. The steps involved in the execution of the K-Means Algorithm has also been described in this paper to serve as a guide for practical implementations.
Keywords: Data mining, clustering analysis, K-means algorithm, Hadoop, MapReduce.